Unsupervised Learning: Trade&Ahead¶

Marks: 60

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [1]:
#for data manipulation
import numpy as np
import pandas as pd

#for data visualization
import seaborn as sns
import matplotlib.pyplot as plt

#for statistics
import scipy.stats as stats

#for scaling data with z-score
from sklearn.preprocessing import StandardScaler

#for calculating distances
from scipy.spatial.distance import cdist

#for k-means clustering
from sklearn.cluster import KMeans
#for calculating silhouette scores
from sklearn.metrics import silhouette_score
#for graphing and dsiplaying silhouette score and elbow curve
#installing yellowbrick
!pip install yellowbrick
from yellowbrick.cluster import SilhouetteVisualizer, KElbowVisualizer

#for calculating distances
from scipy.spatial.distance import pdist

#for hierarchical clustering
from sklearn.cluster import AgglomerativeClustering
#for making dendrograms and calculating cophenetic correlation
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet

#for ignoring warnings
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: yellowbrick in c:\users\neha\anaconda3\lib\site-packages (1.5)
Requirement already satisfied: scikit-learn>=1.0.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (1.2.1)
Requirement already satisfied: cycler>=0.10.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: numpy>=1.16.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (1.23.5)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (3.7.0)
Requirement already satisfied: scipy>=1.0.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (1.10.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.0.5)
Requirement already satisfied: packaging>=20.0 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (22.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (10.0.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\neha\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\neha\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.1)
Requirement already satisfied: six>=1.5 in c:\users\neha\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
In [2]:
#importing the data from csv file into a dataframe called stocks
stocks = pd.read_csv('stock_data.csv')

Data Overview¶

In [3]:
#first five rows of stocks
stocks.head()
Out[3]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 AAL American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219
1 ABBV AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068
2 ABT Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171
3 ADBE Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651
4 ADI Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810
In [4]:
#last five rows of stocks
stocks.tail()
Out[4]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
335 YHOO Yahoo Inc. Information Technology Internet Software & Services 33.259998 14.887727 1.845149 15 459 -1032187000 -4359082000 -4.64 939457327.6 28.976191 6.261775
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 435353535.4 17.682214 -3.838260
337 ZBH Zimmer Biomet Holdings Health Care Health Care Equipment 102.589996 9.347683 1.404206 1 100 376000000 147000000 0.78 188461538.5 131.525636 -23.884449
338 ZION Zions Bancorp Financials Regional Banks 27.299999 -1.158588 1.468176 4 99 -43623000 309471000 1.20 257892500.0 22.749999 -0.063096
339 ZTS Zoetis Health Care Pharmaceuticals 47.919998 16.678836 1.610285 32 65 272000000 339000000 0.68 498529411.8 70.470585 1.723068

Observations:¶

  • Each rows seems to represent a stock in the stock market.
  • The ticker symbol seems to be unique for each row. Its primary purpose is identification of the stock.
In [5]:
#viewing number of rows and columns in stocks
stocks.shape
Out[5]:
(340, 15)

Observations:¶

There are 340 rows and 15 columns in stocks.

In [6]:
#viewing stocks column datatypes and further information 
stocks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB

Observations:¶

  • Ticker Symbol, Security, GICS Sector, and GICS Sub Industry are all categorical variables and object datatypes.
  • Current Price, Price Change, Volatility, Earnings Per Share, Estimated Shares Outstanding, P/E Ratio, and P/B Ratio are all numerical variables and float datatypes.
  • ROE, Cash Ratio, Net Cash Flow, and Net Income are all numerical variables and integer datatypes.
  • All of the columns have 340 non-null values each, so there seem to be no missing values.
In [7]:
#viewing the statistical summary of all of the stocks columns
stocks.describe(include='all')
Out[7]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
count 340 340 340 340 340.000000 340.000000 340.000000 340.000000 340.000000 3.400000e+02 3.400000e+02 340.000000 3.400000e+02 340.000000 340.000000
unique 340 340 11 104 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top AAL American Airlines Group Industrials Oil & Gas Exploration & Production NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq 1 1 53 16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN 80.862345 4.078194 1.525976 39.597059 70.023529 5.553762e+07 1.494385e+09 2.776662 5.770283e+08 32.612563 -1.718249
std NaN NaN NaN NaN 98.055086 12.006338 0.591798 96.547538 90.421331 1.946365e+09 3.940150e+09 6.587779 8.458496e+08 44.348731 13.966912
min NaN NaN NaN NaN 4.500000 -47.129693 0.733163 1.000000 0.000000 -1.120800e+10 -2.352800e+10 -61.200000 2.767216e+07 2.935451 -76.119077
25% NaN NaN NaN NaN 38.555000 -0.939484 1.134878 9.750000 18.000000 -1.939065e+08 3.523012e+08 1.557500 1.588482e+08 15.044653 -4.352056
50% NaN NaN NaN NaN 59.705000 4.819505 1.385593 15.000000 47.000000 2.098000e+06 7.073360e+08 2.895000 3.096751e+08 20.819876 -1.067170
75% NaN NaN NaN NaN 92.880001 10.695493 1.695549 27.000000 99.000000 1.698108e+08 1.899000e+09 4.620000 5.731175e+08 31.764755 3.917066
max NaN NaN NaN NaN 1274.949951 55.051683 4.580042 917.000000 958.000000 2.076400e+10 2.444200e+10 50.090000 6.159292e+09 528.039074 129.064585

Observations:¶

  • As observed in the beginning, Ticker Symbol is unique for each row, and it turns out that Security is also unique for each row.
  • There are 11 unique values in GICS Sector, and 104 unique values in GICS Sub Industry.
  • The average current price of stocks is around 80.86 dollars. It ranges from 4.50 dollars to almost 1,275 dollars.
  • The average percentage change in stock prices over 13 weeks is around 4.08 percent. It ranges from around -47.13 to 55.05 percent.
  • The average standard deviation in stock prices over 13 weeks is around 1.53. It ranges from 0.73 to 4.58.
  • All of the columns have a total of 340 values each.
In [8]:
#viewing duplicate values in stocks
stocks.duplicated().sum()
Out[8]:
0

Observations:¶

There are no duplicate values in stocks.

In [9]:
#viewing missing values in stocks
stocks.isnull().sum()
Out[9]:
Ticker Symbol                   0
Security                        0
GICS Sector                     0
GICS Sub Industry               0
Current Price                   0
Price Change                    0
Volatility                      0
ROE                             0
Cash Ratio                      0
Net Cash Flow                   0
Net Income                      0
Earnings Per Share              0
Estimated Shares Outstanding    0
P/E Ratio                       0
P/B Ratio                       0
dtype: int64

Observations:¶

There are no missing values in stocks.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

In [10]:
#creating histogram for every variable in stocks
#color is set to thistle
#excluding Ticker Symbol and Security columns because it is known they are unique for each row
for i in stocks.columns: #for each column in stocks
    if (i=='Ticker Symbol' or i=='Security'): # if the column is Ticker Symbol or Security
        pass #skip the column
    elif (i== 'GICS Sector'): #if the column is GICS Sector
        sns.histplot(data=stocks, x=i, color='thistle') #plot the column on x-axis
        plt.title(i) #title is column name
        plt.grid(False) #do not display grid
        plt.xticks(rotation=90) #rotate x-axis labels 90 degrees
        plt.show(); #display histogram
    elif (i=='GICS Sub Industry'): #if column is GICS Sub Industry
        sns.histplot(data=stocks, x=i, color='thistle') #plot column on x-axis
        plt.title(i) #title is column name
        plt.grid(False) #do not display grid
        plt.xticks(fontsize=4, rotation=90) #x-axis labels are rotated 90 degrees/font size is decreased to 4
        plt.show();#display histogram
    else: #for all other columns
        sns.histplot(data=stocks, x=i, color='thistle') #plot column on x-axis
        plt.title(i) #title is column name
        plt.grid(False) #do not display grid
        plt.show(); #display histogram

Observations:¶

  • Industrials is the GICS sector with the highest number of stocks, and Telecommunications Services is the GICS sector with the lowest number of stocks.
  • The stocks are very unevenly distributed among different GICS sub industries. Some have only one stock, and some have as many as almost 16 stocks.
  • The data for current price is somewhat skewed to the right. Most of the stock prices are in the range 0-100 dollars. Very few stocks are over 1,200 dollars.
  • The data for price change is in a bell-curve shape, but it is not perfectly symmetrical.
  • The data for volatility is skewed to the right. The range is from below 1 standard deviation to almost 4.5 standard deviations.
  • The data for ROE, cash ratio, estimated shares outstanding and P/E ratio is also skewed to the right. The data for net cash flow, net income, earnings per share, and P/B ratio somewhat resembles a bell-curve.

Bivariate Analysis¶

In [11]:
#adding the numerical columns of stocks to a variable called stocks_num
stocks_num = stocks.select_dtypes(np.number)
#using stocks_num columns to make heatmap
#labels are limited to 2 decimal places/range is from -1 to 1
sns.heatmap(stocks_num.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
In [12]:
#using stocks_num columns to make pairplot, including kde on the diagonals
sns.pairplot(data=stocks_num, diag_kind='kde')
plt.show(); #displaying pairplot

Observations:¶

  • There seems to be no strong negative or positive correlation among any variables.
  • The highest negative correlation seen in stocks is -0.41, and it is seen between price change and volatility. It is also seen between ROE and earnings per share.
  • The highest positive correlation seen in stocks is 0.59, and it is seen between estimated shares outstanding and net income.

Questions¶

1) What does the distribution of stock prices look like?

In [13]:
#creating histogram using data from stocks, plotting Current Price on the x-axis
#setting color to thistle
sns.histplot(data=stocks, x='Current Price', color='thistle')
plt.title(' Distribution of Current Stock Prices') #setting title of histogram
plt.xlabel('Current Stock Price') #setting title of x-axis
plt.ylabel('Number of Stocks') #setting title of y-axis
plt.grid(False) #not displaying grid
plt.show(); #displaying histogram

Observations:¶

  • The data for current stock prices is slightly skewed to the right.
  • Most of the stocks are between 0 and 200 dollars.
  • Very few stocks are over 1,200 dollars.

2) The stocks of which economic sector have seen the maximum price increase on average?

In [14]:
#organizing stocks by GICS sector
#finding the mean price change for each GICS sector
#sorting the means in descending order
stocks.groupby('GICS Sector')['Price Change'].mean().sort_values(ascending=False)
Out[14]:
GICS Sector
Health Care                     9.585652
Consumer Staples                8.684750
Information Technology          7.217476
Telecommunications Services     6.956980
Real Estate                     6.205548
Consumer Discretionary          5.846093
Materials                       5.589738
Financials                      3.865406
Industrials                     2.833127
Utilities                       0.803657
Energy                        -10.228289
Name: Price Change, dtype: float64

Observations:¶

The maximum average price increase is seen in the Health Care sector.

3) How are the different variables correlated with each other?

In [15]:
#viewing the correlation heatmap for numerical columns in stocks
#labels are limited to 2 decimal places/range is from -1 to 1
sns.heatmap(stocks_num.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap

Observations:¶

  • There is no strong correlation among the different variables in stocks.
  • The highest positive correlation is 0.59, and it is between estimated shares outstanding and net income.
  • The highest negative correlation is -0.41. It is seen between ROE and earnings per share, and between price change and volatility.

4) Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?

In [16]:
#organizing stocks by GICS sector
#finding the mean cash ratio for each GICS sector
#sorting the means in descending order
stocks.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
Out[16]:
GICS Sector
Information Technology         149.818182
Telecommunications Services    117.000000
Health Care                    103.775000
Financials                      98.591837
Consumer Staples                70.947368
Energy                          51.133333
Real Estate                     50.111111
Consumer Discretionary          49.575000
Materials                       41.700000
Industrials                     36.188679
Utilities                       13.625000
Name: Cash Ratio, dtype: float64

Observations:¶

  • The largest average cash ratio is seen in the Information Technology sector, followed by the Telecommunications Services sector. The third largest average cash ratio is seen in the Health Care sector.
  • The smallest average cash ratio is seen in the Utilities sector. The second smallest average cash ratio is seen in the Industrials sector, and the third smallest average cash ratio is seen in the Materials sector.

5) P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

In [17]:
#organizing stocks by GICS sector
#finding the mean P/E ratio for each GICS sector
#sorting the means in descending order
stocks.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
Out[17]:
GICS Sector
Energy                         72.897709
Information Technology         43.782546
Real Estate                    43.065585
Health Care                    41.135272
Consumer Discretionary         35.211613
Consumer Staples               25.521195
Materials                      24.585352
Utilities                      18.719412
Industrials                    18.259380
Financials                     16.023151
Telecommunications Services    12.222578
Name: P/E Ratio, dtype: float64

Observations:¶

  • The largest average P/E ratio is seen in the Energy sector, followed by the Information Technology sector, and the Real Estate Sector.
  • The smallest average P/E ratio is seen in the Telecommunications Services sector, and the second smallest average P/E ratio is in the Financials Sector. The Industrials sector has the third smallest average P/E ratio.

Data Preprocessing¶

Checking for Duplicate or Missing Values¶

In [18]:
#viewing duplicate values in stocks
stocks.duplicated().sum()
Out[18]:
0
In [19]:
#viewing missing values in stocks
stocks.isnull().sum()
Out[19]:
Ticker Symbol                   0
Security                        0
GICS Sector                     0
GICS Sub Industry               0
Current Price                   0
Price Change                    0
Volatility                      0
ROE                             0
Cash Ratio                      0
Net Cash Flow                   0
Net Income                      0
Earnings Per Share              0
Estimated Shares Outstanding    0
P/E Ratio                       0
P/B Ratio                       0
dtype: int64

Observations:¶

There are no missing or duplicate values in stocks.

Outlier Detection¶

In [20]:
#making boxplots for all numerical columns in stocks, using previously created variable stocks_num
for i in stocks_num.columns: #for each column in stocks_num
    sns.boxplot(data=stocks_num, x=i) #make boxplot by plotting column on x-axis
    plt.title(i) #title of boxplot is column name
    plt.grid(False) #not displaying grid
    plt.show(); #displaying boxplot

Observations:¶

Most of the variables have outliers, but these outliers represent authentic data points. It is best to keep them as they are.

Feature Engineering¶

The variables do not need to be changed. The numerical variables have already been isolated into another dataframe called stocks_num. These columns need to be scaled before clustering.

Scaling the Data¶

In [21]:
#assinging the standard scalar to a variable called scaler
scaler = StandardScaler()
#copying the stocks_num dataframe into a variable called num_columns
num_columns = stocks_num.copy()
#scaling num_columns using scaler variable, representing StandardScalar()
#saving the scaled values in scaled_columns variable
scaled_values = scaler.fit_transform(num_columns)

#adding the values into a new dataframe scaled_stocks
#scaled_stocks will contain scaled_values and the columns from num_columns
scaled_stocks = pd.DataFrame(scaled_values, columns=num_columns.columns)
#viewing the first five rows of scaled_stocks 
scaled_stocks.head()
Out[21]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

EDA after Data Preprocessing¶

In [22]:
#creating histogram for every variable in scaled_stocks
#color is set to thistle
for i in scaled_stocks.columns: #for each column in scaled_stocks
        sns.histplot(data=scaled_stocks, x=i, color='thistle') #plot column on x-axis
        plt.title(i) #title is column name
        plt.grid(False) #do not display grid
        plt.show(); #display histogram
In [23]:
#using scaled_stocks columns to make heatmap
#labels are limited to 2 decimal places/range is from -1 to 1
sns.heatmap(scaled_stocks.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap

Observations:¶

There are no major changes in the distribution of data in each column or correlation between different variables after data preprocessing and scaling.

K-means Clustering¶

Finding Mean Distortions for Different Numbers of Clusters¶

In [24]:
#setting clusters from 1 to 10 (not including 10)
num_clusters = range(2, 10)
#creating an empty list for mean_distortions
mean_distortions = [] 

for k in num_clusters: #for each value in the range of num_clusters
    model = KMeans(n_clusters=k) #assigning model variable the KMeans function with number of clusters equal to value
    model.fit(scaled_stocks) #fit the scaled_stocks data to KMeans function
    prediction = model.predict(scaled_stocks) #make prediction for values in scaled_stocks
    distortion = (sum(np.min(cdist(scaled_stocks, model.cluster_centers_, 'euclidean'), axis=1))
                 /scaled_stocks.shape[0]) #find distortion using euclidean distances
    mean_distortions.append(distortion) #add distortion value to the mean_distortions list
    #print the value and the corresponding mean distortion
    print(k, 'Clusters', 'Mean Distortion:', distortion) 
2 Clusters Mean Distortion: 2.382318498894466
3 Clusters Mean Distortion: 2.2692367155390745
4 Clusters Mean Distortion: 2.179645269703779
5 Clusters Mean Distortion: 2.1129944992818515
6 Clusters Mean Distortion: 2.0565797933792824
7 Clusters Mean Distortion: 2.0307068651453446
8 Clusters Mean Distortion: 1.9666240276860545
9 Clusters Mean Distortion: 1.9274833859398008

Using Elbow Method to Choose the Ideal Number of Clusters¶

In [25]:
#plotting the number of clusters with their corresponding mean distortions
plt.plot(num_clusters, mean_distortions)
plt.title('Mean Distortion vs. k') #setting title of graph
plt.xlabel('k') #setting title of x-axis
plt.ylabel('Mean Distortion') #setting title of y-axis
plt.show(); #displaying graph

Observations:¶

According to the elbow method, the ideal k value seems to be either 6 or 7.

Calculating the Silhouette Scores of Different Numbers of Clusters¶

In [26]:
#setting clusters from 2 to 10 (not including 10)
clusters_num = range(2,10)
silhouette_scores = [] #setting silhouette_scores to empty list

for k in clusters_num: #for each value in clusters_num
    model1 = KMeans(n_clusters=k) #assign model1 the KMeans function for that number of clusters
    prediction1 = model1.fit_predict(scaled_stocks) #make prediction using scaled_stocks values
    score = silhouette_score(scaled_stocks, prediction1) #claculate silhouette score using prediction
    silhouette_scores.append(score) #add score to silhouette_scores
    #print k and the corresponding silhouette score
    print(k, 'Clusters', 'Silhouette Score:', score)
2 Clusters Silhouette Score: 0.43969639509980457
3 Clusters Silhouette Score: 0.45755884975007327
4 Clusters Silhouette Score: 0.45483520750820555
5 Clusters Silhouette Score: 0.4033714342513622
6 Clusters Silhouette Score: 0.42287350755988
7 Clusters Silhouette Score: 0.4179608494109058
8 Clusters Silhouette Score: 0.40232990858584977
9 Clusters Silhouette Score: 0.41161415393845907

Plotting Silhouette Scores¶

In [27]:
#creating a plot for number of clusters and corresponding silhouette score
plt.plot(clusters_num, silhouette_scores)
plt.title('Silhouette Scores vs. k') #setting title of graph
plt.xlabel('k') #setting title of x-axis
plt.ylabel('Silhouette Score') #setting title of y-axis
plt.show(); #displaying graph

Observations:¶

According to the Silhouette scores plot, the ideal k value seems to be 7.

Using Silhouette Coefficients to Find the Ideal k¶

In [28]:
#assigning visualizer to the SilhouetteVisualizer function where KMeans has 7 clusters
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(scaled_stocks) #fitting visualizer to scaled_stocks values
visualizer.show() #displaying visualization
Out[28]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [29]:
#assigning visualizer to the SilhouetteVisualizer function where KMeans has 6 clusters
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(scaled_stocks) #fitting visualizer to scaled_stocks values
visualizer.show() #displaying visualization
Out[29]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Observations and Choosing the Final k:¶

  • A k value of 6 and a k value of 7 both have the same average silhouette score of 0.4. Even the Elbow curve indicates that 6 and 7 can be good values. However, in the plot of the Silhouette scores, there is a steeper line at 7.

  • It seems best to proceed with 7 as the k value.

Creating Model with k Value of 7¶

In [30]:
#adding the KMeans function with 7 clusters into a variable called kmeans_model
kmeans_model = KMeans(n_clusters=7, random_state=0)
#fitting the scaled_stocks values to the kmeans_model
kmeans_model.fit(scaled_stocks)
Out[30]:
KMeans(n_clusters=7, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=7, random_state=0)
In [31]:
#adding a new column called kmeans_cluster to the original stocks
stocks['kmeans_cluster'] = kmeans_model.labels_

Cluster Profiling¶

In [32]:
#saving the data sorted by column average for each cluster into clusters variable
clusters = stocks.groupby('kmeans_cluster').mean()
#adding total value column into stocks to include the total number of values in each cluster
clusters['Total Values'] = stocks.groupby('kmeans_cluster')['Current Price'].count()
In [33]:
#reassigning numerical columns of stocks to stocks_num
stocks_num = stocks.select_dtypes(np.number)
#creating a boxplot for each column in stocks_num vs. cluster
for i in stocks_num.columns: #for each column in stocks_num
    sns.boxplot(data=stocks_num, x='kmeans_cluster', y=i) #plotting clusters on x-axis and column on y
    plt.title(i + ' vs. Clusters') #setting title of boxplot
    plt.xlabel('Clusters') #setting title of x-axis
    plt.ylabel(i) #setting title of y-axis
    plt.show(); #displaying boxplot
In [34]:
#printing the GICS sectors included in each cluster
print('Cluster 1:\n', stocks[stocks['kmeans_cluster'] == 0]['GICS Sector'].unique())
print('Cluster 2:\n', stocks[stocks['kmeans_cluster'] == 1]['GICS Sector'].unique())
print('Cluster 3:\n', stocks[stocks['kmeans_cluster'] == 2]['GICS Sector'].unique())
print('Cluster 4:\n', stocks[stocks['kmeans_cluster'] == 3]['GICS Sector'].unique())
print('Cluster 5:\n', stocks[stocks['kmeans_cluster'] == 4]['GICS Sector'].unique())
print('Cluster 6:\n', stocks[stocks['kmeans_cluster'] == 5]['GICS Sector'].unique())
print('Cluster 7:\n', stocks[stocks['kmeans_cluster'] == 6]['GICS Sector'].unique())
Cluster 1:
 ['Financials' 'Consumer Discretionary' 'Health Care'
 'Information Technology' 'Consumer Staples' 'Telecommunications Services'
 'Energy']
Cluster 2:
 ['Industrials' 'Health Care' 'Consumer Staples' 'Utilities' 'Financials'
 'Real Estate' 'Information Technology' 'Materials'
 'Consumer Discretionary' 'Telecommunications Services' 'Energy']
Cluster 3:
 ['Energy']
Cluster 4:
 ['Industrials' 'Consumer Discretionary' 'Consumer Staples' 'Financials']
Cluster 5:
 ['Information Technology' 'Consumer Discretionary' 'Health Care']
Cluster 6:
 ['Information Technology' 'Health Care' 'Real Estate'
 'Telecommunications Services' 'Energy' 'Consumer Discretionary'
 'Consumer Staples' 'Materials']
Cluster 7:
 ['Energy' 'Industrials' 'Materials' 'Information Technology']

Observations:¶

  • Cluster 4 has higher current prices, compared to other clusters.
  • Clusters 2 and 6 have mostly negative percent changes in price, compared to other clusters.
  • Volatility is very distinct among different clusters. The highest volatility range is seen in cluster 2.
  • Cluster 2 also has a very high and broad range for ROE.
  • The cash ratio is highest for cluster 5.
  • Cluster 0 has the broadest range of net cash flow and the highest net income.
  • Cluster 2 has the lowest net income and lowest earnings per share, compared to other clusters.
  • Cluster 0 has the highest estimated shares outstanding.

Cluster 1:

  • Financials, Consumer Discretionary, Health Care, Information Technology, Consumer Staples, Telecommunications Services, Energy
  • lower current price
  • low, positive percent price change
  • broad net cash flow
  • high net income
  • very high estimated shares outstanding

Cluster 2:

  • Industrials, Health Care, Consumer Staples, Utilities, Financials, Real Estate, Information Technology, Materials, Consumer Discretionary, Telecommunications Services, Energy
  • lower current price
  • positive percent price change

Cluster 3:

  • Energy
  • lower current prices
  • mostly negative percent price change
  • very high volatility
  • high ROE
  • very low net income
  • very low earnings per share

Cluster 4:

  • Industrials, Consumer Discretionary, Consumer Staples, Financials
  • moderate current prices
  • positive percent price change
  • high ROE

Cluster 5:

  • information Technology, Consumer Discretionary, Health Care
  • very high current prices
  • positive percent price change
  • broad range of cash ratio
  • slightly higher earnings per share
  • broad range of P/B ratio

Cluster 6:

  • Information Technology, Health Care, Real Estate, Telecommunications Services, Energy, Consumer Discretionary, Consumer Staples, Materials
  • lower current prices
  • higher, positive percent price change
  • higher and broader range for cash ratio

Cluster 7:

  • Energy, Industrials, Materials, Information Technology
  • lower current prices
  • negative percent price change
  • high volatility

Hierarchical Clustering¶

Finding Cophenetic Correlation for Each Distance Metric and Linkage Method¶

In [35]:
#adding the different distance metric methods into distances
distances=['mahalanobis', 'euclidean', 'cityblock', 'chebyshev']
#adding the different linkage methods into linkage
linkages = ['complete', 'weighted', 'single', 'average']
high_i_l = [0,0]
high_correlation = 0

for i in distances: #for each value in distances
    for l in linkages: #for each value in linkages
        Z = linkage(scaled_stocks, metric=i, method=l) #use the distance metric and linkage method on scaled_stocks values
        c, coph_dists = cophenet(Z, pdist(scaled_stocks)) #find the cophenetic correlation of the values in scaled_stocks
        #print the cophenetic correlation, distance metric, and linkage values
        print('Cophenetic Correlation:', c, 'Distance:', i, 'Linkage:', l)
        if high_correlation < c:
            high_correlation = c
            high_i_l[0] = i
            high_i_l[1] = l
Cophenetic Correlation: 0.7925307202850002 Distance: mahalanobis Linkage: complete
Cophenetic Correlation: 0.8708317490180428 Distance: mahalanobis Linkage: weighted
Cophenetic Correlation: 0.9259195530524591 Distance: mahalanobis Linkage: single
Cophenetic Correlation: 0.9247324030159737 Distance: mahalanobis Linkage: average
Cophenetic Correlation: 0.7873280186580672 Distance: euclidean Linkage: complete
Cophenetic Correlation: 0.8693784298129404 Distance: euclidean Linkage: weighted
Cophenetic Correlation: 0.9232271494002922 Distance: euclidean Linkage: single
Cophenetic Correlation: 0.9422540609560814 Distance: euclidean Linkage: average
Cophenetic Correlation: 0.7375328863205818 Distance: cityblock Linkage: complete
Cophenetic Correlation: 0.731045513520281 Distance: cityblock Linkage: weighted
Cophenetic Correlation: 0.9334186366528574 Distance: cityblock Linkage: single
Cophenetic Correlation: 0.9302145048594667 Distance: cityblock Linkage: average
Cophenetic Correlation: 0.598891419111242 Distance: chebyshev Linkage: complete
Cophenetic Correlation: 0.9127355892367 Distance: chebyshev Linkage: weighted
Cophenetic Correlation: 0.9062538164750717 Distance: chebyshev Linkage: single
Cophenetic Correlation: 0.9338265528030499 Distance: chebyshev Linkage: average

Observations:¶

0.942 is the highest cophenetic correlation, and the corresponding distance metric used for it is Euclidean distance. The corresponding linkage method used for it is average.

Finding Cophenetic Correlation for Euclidean Distance Metric and Each Linkage Method/Also Creating Dendrograms for Each Linkage Method¶

In [36]:
#adding the different linkage methods into linkage
linkages = ['single', 'weighted', 'ward', 'centroid', 'complete', 'average']
high_i_l = [0,0]
high_correlation = 0

for l in linkages: #for each value in linkages
        Z = linkage(scaled_stocks, metric='euclidean', method=l) #use the euclidean distance metric and linkage method on scaled_stocks values
        c, coph_dists = cophenet(Z, pdist(scaled_stocks)) #find the cophenetic correlation of the values in scaled_stocks
        #print the cophenetic correlation, distance metric, and linkage values
        print('Cophenetic Correlation:', c, 'Linkage:', l)
        plt.figure(figsize=(10, 5)) #figure size is set to (10,5)
        plt.title('Dendrogram for Linkage:'+ l) #setting title of dendrogram
        plt.grid(False) #not displaying grid
        dendrogram(Z) #making dendrogram
        plt.show(); #displaying dendrogram
        if high_correlation < c:
            high_correlation = c
            high_i_l[0] = 'euclidean'
            high_i_l[1] = l
Cophenetic Correlation: 0.9232271494002922 Linkage: single
Cophenetic Correlation: 0.8693784298129404 Linkage: weighted
Cophenetic Correlation: 0.7101180299865353 Linkage: ward
Cophenetic Correlation: 0.9314012446828154 Linkage: centroid
Cophenetic Correlation: 0.7873280186580672 Linkage: complete
Cophenetic Correlation: 0.9422540609560814 Linkage: average

Observations:¶

  • 0.942 is still the highest cophenetic linkage with Euclidean distance metric and average linkage method.
  • According to the dendrogram, the ideal number of clusters seems to be about 7 clusters.

Creating Final Hierarchical Clustering Model¶

In [37]:
#building a hierarchical clustering model using 7 clusters, euclidean distance, and average linkage
hier_model = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='average')
hier_model.fit(scaled_stocks) #fitting the model to scaled_stocks
Out[37]:
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=7)
In [38]:
#adding the clusters to a new column called hier_clusters in stocks dataframe
stocks['hier_cluster'] = hier_model.labels_

Cluster Profiling¶

In [39]:
#reassigning numerical columns of stocks to stocks_num
stocks_num = stocks.select_dtypes(np.number)
#creating a boxplot for each column in stocks_num1 vs. cluster
for i in stocks_num.columns: #for each column in stocks_num
    sns.boxplot(data=stocks_num, x='hier_cluster', y=i) #plotting clusters on x-axis and column on y
    plt.title(i + ' vs. Clusters') #setting title of boxplot
    plt.xlabel('Clusters') #setting title of x-axis
    plt.ylabel(i) #setting title of y-axis
    plt.show(); #displaying boxplot
In [40]:
#printing the GICS sectors included in each cluster
print('Cluster 1:\n', stocks[stocks['hier_cluster'] == 0]['GICS Sector'].unique())
print('Cluster 2:\n', stocks[stocks['hier_cluster'] == 1]['GICS Sector'].unique())
print('Cluster 3:\n', stocks[stocks['hier_cluster'] == 2]['GICS Sector'].unique())
print('Cluster 4:\n', stocks[stocks['hier_cluster'] == 3]['GICS Sector'].unique())
print('Cluster 5:\n', stocks[stocks['hier_cluster'] == 4]['GICS Sector'].unique())
print('Cluster 6:\n', stocks[stocks['hier_cluster'] == 5]['GICS Sector'].unique())
print('Cluster 7:\n', stocks[stocks['hier_cluster'] == 6]['GICS Sector'].unique())
Cluster 1:
 ['Energy']
Cluster 2:
 ['Financials' 'Information Technology']
Cluster 3:
 ['Health Care' 'Consumer Discretionary' 'Information Technology']
Cluster 4:
 ['Information Technology']
Cluster 5:
 ['Consumer Discretionary']
Cluster 6:
 ['Information Technology']
Cluster 7:
 ['Industrials' 'Health Care' 'Information Technology' 'Consumer Staples'
 'Utilities' 'Financials' 'Real Estate' 'Materials'
 'Consumer Discretionary' 'Energy' 'Telecommunications Services']

Observations:¶

  • Cluster 2 has a higher range of current prices, compared to other clusters.
  • Cluster 0 has the lowest range of percent price change, when compared to other clusters. It consists of mostly negative percent price changes.
  • Cluster 0 has the higher range of volatility and the highest ROE, when compared to other clusters.
  • Cluster 1 has the highest net cash flow, and the highest net income. Cluster 0 has the lowest net income, and lowest earnings per share.
  • Cluster 1 has the broadest range for estimated shares outstanding.
  • Cluster 2 has the highest P/E ratio.

Cluster 1:

  • Energy
  • lower current price
  • mostly negative percent price change
  • very high volatility
  • very high ROE
  • very low net income
  • very low earnings per share

Cluster 2:

  • Financials, Information Technology
  • lower current prices
  • positive percent price change
  • very high net cash flow and high net income
  • broader range of estimated shares outstanding

Cluster 3:

  • Health Care, Consumer Discretionary, Information Technology
  • slightly higher and broader range of current prices
  • higher, positive percent price change
  • moderate volatility
  • high P/E ratio

Cluster 4:

  • Information Technology
  • lower current prices
  • higher, positive percent price change
  • low ROE
  • highest cash ratio
  • high estimated shares outstanding

Cluster 5:

  • Consumer Discretionary
  • very high current price
  • low, positive percent price change
  • highest earning per share

Cluster 6:

  • Information Technology
  • moderate current price
  • positive percent price change
  • very high P/B ratio

Cluster 7:

  • Industrials, Health Care, Information Technology, Consumer Staples, Utilities, Financials, Real Estate, Materials, Consumer Discretionary, Energy, Telecommunications Services
  • lower current prices
  • mostly positive percent price change
  • broader range of volatility

K-means vs Hierarchical Clustering¶

Similarities in Both Techniques

  • They both took a similar amount of time to perform.
  • In both techniques, the ideal number of clusters was 7.

Differences in Both Techniques

  • The K-means clustering technique seems to give more distinct clusters because more clusters had different characteristics. There was something to describe about most clusters. However, in hierarchical clustering, there were one or two clusters that had many characteristics, and there was not much to describe about the other clusters.
  • In terms of GICS Sectors, the hierarchical clustering technique seems to be more specific than the K-means clustering technique. In K-means clustering, many clusters had many GICS Sectors, and very few included just a few. In hierarchical clustering, only cluster 7 had many GICS Sectors. The other clusters had just 1 to 3 GICS Sectors.

Similarities in Cluster Profiles\ K-means Cluster 3, K-means Cluster 7, and Hierarchical Cluster 1

  • low current price
  • negative percent price change
  • high volatility

K-means Cluster 6 and Hierarchical Cluster 4

  • low current price
  • higher, positive percent price change
  • higher or broader range of cash ratio

K-means Cluster 1 and Hierarchical Cluster 2

  • low current price
  • positive percent price change
  • high net income
  • higher or broader range of estimated shares outstanding
  • higher or broader range of net cash flow

K-means Cluster 5 and Hierarchical Cluster 5

  • very high current prices
  • positive percent price change
  • very high earnings per share

K-means Cluster 4 and Hierarchical Cluster 6

  • moderate current price
  • positive percent price change

K-means Cluster 2 and Hierarchical Cluster 7

  • lower current price
  • positive percent price change
  • broader category of GICS Sectors

Differences in Cluster Profiles\ The hierarchical cluster profile for cluster 3 really did not fit anywhere because its combination of characteristics was too different from the other clusters. There are a higher or broader range of current prices, a higher and more positive percent price change, moderate volatility, and high P/E ratio.

Actionable Insights and Recommendations¶

According to the cluster profiles provided:

  • Stocks in K-means cluster 6 and hierarchical cluster 4 are low in prices, but had very high positive percent change in price over the past 13 weeks. For those who are willing to only take a small risk and invest a small amount of money in their stocks, these stocks are lower in current price, but might increase in a positive direction in the future.
  • K-means ckuster 5 and hierarchical cluster 5 have very high current prices, but they have a high earnings per share. These stocks are well-suited for those willing to invest more money in their stocks. In relation to the company's common shares, the company's net profit is in a good standing.
  • K-means Cluster 1 and hierarchical cluster 2 have a high net income, high or broad range of estimated shares outstanding, and cash flow.

In conclusion, the most profitable or stable stocks to invest in are considered to be:

  • K-means Cluster 1
  • K-means Cluster 5
  • K-means Cluster 6
  • Hierarchical Cluster 2
  • Hierarchical Cluster 4
  • Hierarchical Cluster 5